Deciding Indexing Strings with Statistical Analysis

نویسندگان

  • Yoshiyuki Takeda
  • Kyoji Umemura
  • Eiko Yamamoto
چکیده

Deciding indexing string is important for Information Retrieval. Ideally, the strings should be the words that represent the documents or query. Although each single word may be the first candidate of indexing strings for English corpus, it may not ideal due to the existence of compound nouns, which are often good indexing strings, and which depends on genre of corpus. The situation is even worse in Japanese or Chi-nese where the words are not separated by spaces. In this paper, we proposed a method to decide indexing strings based on statistical analysis. The novel features of our method are to make the most of the statistical measure called adaptation and not using language dependent resources such as dictionaries and stop words list. We have evaluated our method using Japanese test collection, and we have found that our method actually improves the precision of information retrieval systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

QGramProjector: Q-Gram Projection for Indexing Highly-Similar Strings

Q-gram (or n-gram, k-mer) models are used in many research areas, e.g. in computational linguistics for statistical natural language processing, in computer science for approximate string searching, and in computational biology for sequence analysis and data compression. For a collection of N strings, one usually creates a separate positional q-gram index structure for each string, or at least ...

متن کامل

Probabilistic Threshold Indexing for Uncertain Strings

Strings form a fundamental data type in computer systems. String searching has been extensively studied since the inception of computer science. Increasingly many applications have to deal with imprecise strings or strings with fuzzy information in them. String matching becomes a probabilistic event when a string contains uncertainty, i.e. each position of the string can have different probable...

متن کامل

Compressed and Searchable Indexes for Highly Similar Strings (Invited Talk)

The collection indexing problem is defined as follows: Given a collection of highly similar strings, build a compressed index for the collection of strings, and when a pattern is given, find all occurrences of the pattern in the given strings. Since the index is compressed, we also need a separate operation which retrieves a specified substring of one of the given strings. Such a collection of ...

متن کامل

A Generalized Approach for Image Indexing and Retrieval Based on 2-D Strings

2-D strings is one of a few representation structures originally designed for use in an IDB environment. In this paper, we propose a generalized approach for 2-D string based indexing which avoids the exhaustive search through the entire database of previous 2-D strings based techniques. The classical framework of representation of 2-D strings is also specialized to the cases of scaled and unsc...

متن کامل

Alphabet Indexing by Cluster Analysis: A Method for Knowledge Acquisition from Amino Acid Sequences

Knowledge acquisition has been an important topic in Arti cial Intelligence and a variety of contributions have been made in various elds where computers can be applied. Genome Informatics is one of the most attracting elds for which knowledge acquisition techniques are strongly expected. In [3] a knowledge acquisiton system for sequence data has been developed and has shown successful experime...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002